Runtime fault detection, fault location, and circuit recovery in an accelerator device

ABSTRACT

An apparatus to facilitate runtime fault detection, fault location, and circuit recovery in an accelerator device is disclosed. In one implementation, the accelerator device comprises a sensor network comprising a plurality of sensors; a secure device manager (SDM); and a sensor aggregator communicably coupled to the sensor network and the SDM. In one implementation, the sensor aggregator can receive sensor data from the sensor network; analyze the sensor data to detect a fault condition; determine a spatial location of the fault condition based on the sensor data; and generate an event for the SDM to cause the SDM to mitigate the fault condition.

RELATED APPLICATIONS

This application claims the benefit of priority from U.S. ProvisionalPatent Application Ser. No. 63/083,783 filed on Sep. 25, 2020, the fulldisclosure of which is incorporated herein by reference.

FIELD

This disclosure relates generally to data processing and moreparticularly to runtime fault detection, fault location, and circuitrecovery in an accelerator device.

BACKGROUND OF THE DISCLOSURE

A programmable logic device can be configured to support a multi-tenantusage model. A multi-tenant usage model arises where a single device isprovisioned by a server to support N clients. It is assumed that theclients do not trust each other, that the clients do not trust theserver, and that the server does not trust the clients. The multi-tenantmodel is configured using a base configuration followed by an arbitrarynumber of partial reconfigurations (i.e., a process that changes only asubset of configuration bits while the rest of the device continues toexecute). The server is typically managed by some trusted party such asa cloud service provider.

BRIEF DESCRIPTION OF THE DRAWINGS

So that the manner in which the above recited features of the presentembodiments can be understood in detail, a more particular descriptionof the embodiments, briefly summarized above, may be had by reference toembodiments, some of which are illustrated in the appended drawings. Itis to be noted, however, that the appended drawings illustrate typicalembodiments and are therefore not to be considered limiting of itsscope.

FIG. 1 is a diagram of an illustrative programmable integrated circuitin accordance with an embodiment.

FIG. 2 is a diagram showing how configuration data is created by a logicdesign system and loaded into a programmable device to configure thedevice for operation in a system in accordance with an embodiment.

FIG. 3 is a diagram of a circuit design system that may be used todesign integrated circuits in accordance with an embodiment.

FIG. 4 is a diagram of illustrative computer-aided design (CAD) toolsthat may be used in a circuit design system in accordance with anembodiment.

FIG. 5 is a flow chart of illustrative steps for designing an integratedcircuit in accordance with an embodiment.

FIG. 6 is a diagram of an illustrative multitenancy system in accordancewith an embodiment.

FIG. 7 is a diagram of a programmable integrated circuit having a staticregion and multiple partial reconfiguration (PR) sandbox regions inaccordance with an embodiment.

FIG. 8 illustrates a computing device employing a disaggregate computecomponent, according to implementations of the disclosure.

FIG. 9 illustrates a disaggregate compute component, according to oneimplementation of the disclosure.

FIG. 10 illustrates an accelerator device for providing runtime faultdetection, location and circuit recovery, in accordance withimplementations of the disclosure.

FIG. 11 illustrates an accelerator device for providing runtime faultdetection, location and circuit recovery, in accordance withimplementations of the disclosure.

FIG. 12 is a flow diagram illustrating a method for runtime faultdetection, fault location, and circuit recovery in an acceleratordevice, in accordance with implementations of the disclosure.

DETAILED DESCRIPTION

Implementations are directed to runtime fault detection, fault location,and circuit recovery in an accelerator device. Disaggregated computingis on the rise in data centers. Cloud service providers (CSP) aredeploying solutions where processing of a workload is distributed ondisaggregated compute resources such as CPUs and hardware accelerators,such as FPGAs, that are connected via network instead of being on thesame platform and connected via physical links such as PCIe. The computedisaggregation enables improved resource utilization and lowers TotalCost of Ownership (TCO) by making more efficient use of availableresources. Disaggregation also enables pooling a large number ofhardware accelerators for large computation making the computation moreefficient and performant.

In the following description, numerous specific details are set forth toprovide a more thorough understanding. However, it may be apparent toone of skill in the art that the embodiments described herein may bepracticed without one or more of these specific details. In otherinstances, well-known features have not been described to avoidobscuring the details of the present embodiments.

Various embodiments are directed to techniques for disaggregatedcomputing for programmable integrated circuits, for instance.

System Overview

While the concepts of the present disclosure are susceptible to variousmodifications and alternative forms, specific embodiments thereof havebeen shown by way of example in the drawings and are described herein indetail. It should be understood, however, that there is no intent tolimit the concepts of the present disclosure to the particular formsdisclosed, but on the contrary, the intention is to cover allmodifications, equivalents, and alternatives consistent with the presentdisclosure and the appended claims.

References in the specification to “one embodiment,” “an embodiment,”“an illustrative embodiment,” etc., indicate that the embodimentdescribed may include a particular feature, structure, orcharacteristic, but every embodiment may or may not necessarily includethat particular feature, structure, or characteristic. Moreover, suchphrases are not necessarily referring to the same embodiment. Further,when a particular feature, structure, or characteristic is described inconnection with an embodiment, it is submitted that it is within theknowledge of one skilled in the art to effect such feature, structure,or characteristic in connection with other embodiments whether or notexplicitly described. Additionally, it should be appreciated that itemsincluded in a list in the form of “at least one A, B, and C” can mean(A); (B); (C); (A and B); (A and C); (B and C); or (A, B, and C).Similarly, items listed in the form of “at least one of A, B, or C” canmean (A); (B); (C); (A and B); (A and C); (B and C); or (A, B, and C).

The disclosed embodiments may be implemented, in some cases, inhardware, firmware, software, or any combination thereof. The disclosedembodiments may also be implemented as instructions carried by or storedon a transitory or non-transitory machine-readable (e.g.,computer-readable) storage medium, which may be read and executed by oneor more processors. A machine-readable storage medium may be embodied asany storage device, mechanism, or other physical structure for storingor transmitting information in a form readable by a machine (e.g., avolatile or non-volatile memory, a media disc, or other media device).

In the drawings, some structural or method features may be shown inspecific arrangements and/or orderings. However, it should beappreciated that such specific arrangements and/or orderings may not berequired. Rather, in some embodiments, such features may be arranged ina different manner and/or order than shown in the illustrative figures.Additionally, the inclusion of a structural or method feature in aparticular figure is not meant to imply that such feature is required inall embodiments and, in some embodiments, may not be included or may becombined with other features.

Programmable integrated circuits use programmable memory elements tostore configuration data. During programming of a programmableintegrated circuit, configuration data is loaded into the memoryelements. The memory elements may be organized in arrays having numerousrows and columns. For example, memory array circuitry may be formed inhundreds or thousands of rows and columns on a programmable logic deviceintegrated circuit.

During normal operation of the programmable integrated circuit, eachmemory element is configured to provide a static output signal. Thestatic output signals that are supplied by the memory elements serve ascontrol signals. These control signals are applied to programmable logicon the integrated circuit to customize the programmable logic to performa desired logic function.

It may sometimes be desirable to reconfigure only a portion of thememory elements during normal operation. This type of reconfiguration inwhich only a subset of memory elements are being loaded with newconfiguration data during runtime is sometimes referred to as “partialreconfiguration”. During partial reconfiguration, new data should bewritten into a selected portion of memory elements (sometimes referredto as “memory cells”).

An illustrative programmable integrated circuit such as programmablelogic device (PLD) 10 is shown in FIG. 1. As shown in FIG. 1,programmable integrated circuit 10 may have input-output circuitry 12for driving signals off of device 10 and for receiving signals fromother devices via input-output pins 14. Interconnection resources 16such as global and local vertical and horizontal conductive lines andbuses may be used to route signals on device 10. Interconnectionresources 16 include fixed interconnects (conductive lines) andprogrammable interconnects (i.e., programmable connections betweenrespective fixed interconnects). Programmable logic 18 may includecombinational and sequential logic circuitry. The programmable logic 18may be configured to perform a custom logic function.

Examples of programmable logic device 10 include, but is not limited to,programmable arrays logic (PALs), programmable logic arrays (PLAs),field programmable logic arrays (FPLAs), electrically programmable logicdevices (EPLDs), electrically erasable programmable logic devices(EEPLDs), logic cell arrays (LCAs), complex programmable logic devices(CPLDs), and field programmable gate arrays (FPGAs), just to name a few.System configurations in which device 10 is a programmable logic devicesuch as an FPGA is sometimes described as an example but is not intendedto limit the scope of the present embodiments.

Programmable integrated circuit 10 contains memory elements 20 that canbe loaded with configuration data (also called programming data) usingpins 14 and input-output circuitry 12. Once loaded, the memory elements20 may each provide a corresponding static control output signal thatcontrols the state of an associated logic component in programmablelogic 18. Typically, the memory element output signals are used tocontrol the gates of metal-oxide-semiconductor (MOS) transistors. Someof the transistors may be p-channel metal-oxide-semiconductor (PMOS)transistors. Many of these transistors may be n-channelmetal-oxide-semiconductor (NMOS) pass transistors in programmablecomponents such as multiplexers. When a memory element output is high,an NMOS pass transistor controlled by that memory element can be turnedon to pass logic signals from its input to its output. When the memoryelement output is low, the pass transistor is turned off and does notpass logic signals.

A typical memory element 20 is formed from a number of transistorsconfigured to form cross-coupled inverters. Other arrangements (e.g.,cells with more distributed inverter-like circuits) may also be used.With one suitable approach, complementary metal-oxide-semiconductor(CMOS) integrated circuit technology is used to form the memory elements20, so CMOS-based memory element implementations are described herein asan example. In the context of programmable integrated circuits, thememory elements store configuration data and are therefore sometimesreferred to as configuration random-access memory (CRAM) cells.

An illustrative system environment for device 10 is shown in FIG. 2.Device 10 may be mounted on a board 36 in a system 38. In general,programmable logic device 10 may receive configuration data fromprogramming equipment or from other suitable equipment or device. In theexample of FIG. 2, programmable logic device 10 is the type ofprogrammable logic device that receives configuration data from anassociated integrated circuit 40. With this type of arrangement, circuit40 may, if desired, be mounted on the same board 36 as programmablelogic device 10.

Circuit 40 may be an erasable-programmable read-only memory (EPROM)chip, a programmable logic device configuration data loading chip withbuilt-in memory (sometimes referred to as a “configuration device”), orother suitable device. When system 38 boots up (or at another suitabletime), the configuration data for configuring the programmable logicdevice may be supplied to the programmable logic device from device 40,as shown schematically by path 42. The configuration data that issupplied to the programmable logic device may be stored in theprogrammable logic device in its configuration random-access-memoryelements 20.

System 38 may include processing circuits 44, storage 46, and othersystem components 48 that communicate with device 10. The components ofsystem 38 may be located on one or more boards such as board 36 or othersuitable mounting structures or housings and may be interconnected bybuses, traces, and other electrical paths 50.

Configuration device 40 may be supplied with the configuration data fordevice 10 over a path such as path 52. Configuration device 40 may, forexample, receive the configuration data from configuration data loadingequipment 54 or other suitable equipment that stores this data inconfiguration device 40. Device 40 may be loaded with data before orafter installation on board 36.

As shown in FIG. 2, the configuration data produced by a logic designsystem 56 may be provided to equipment 54 over a path such as path 58.The equipment 54 provides the configuration data to device 40, so thatdevice 40 can later provide this configuration data to the programmablelogic device 10 over path 42. Logic design system 56 may be based on oneor more computers and one or more software programs. In general,software and data may be stored on any computer-readable medium(storage) in system 56 and is shown schematically as storage 60 in FIG.2.

In a typical scenario, logic design system 56 is used by a logicdesigner to create a custom circuit design. The system 56 producescorresponding configuration data which is provided to configurationdevice 40. Upon power-up, configuration device 40 and data loadingcircuitry on programmable logic device 10 is used to load theconfiguration data into CRAM cells 20 of device 10. Device 10 may thenbe used in normal operation of system 38.

After device 10 is initially loaded with a set of configuration data(e.g., using configuration device 40), device 10 may be reconfigured byloading a different set of configuration data. Sometimes it may bedesirable to reconfigure only a portion of the memory cells on device 10via a process sometimes referred to as partial reconfiguration. Asmemory cells are typically arranged in an array, partial reconfigurationcan be performed by writing new data values only into selectedportion(s) in the array while leaving portions of array other than theselected portion(s) in their original state.

It can be a significant undertaking to design and implement a desired(custom) logic circuit in a programmable logic device. Logic designerstherefore generally use logic design systems based oncomputer-aided-design (CAD) tools to assist them in designing circuits.A logic design system can help a logic designer design and test complexcircuits for a system. When a design is complete, the logic designsystem may be used to generate configuration data for electricallyprogramming the appropriate programmable logic device.

An illustrative logic circuit design system 300 in accordance with anembodiment is shown in FIG. 3. If desired, circuit design system of FIG.3 may be used in a logic design system such as logic design system 56shown in FIG. 2. Circuit design system 300 may be implemented onintegrated circuit design computing equipment. For example, system 300may be based on one or more processors such as personal computers,workstations, etc. The processor(s) may be linked using a network (e.g.,a local or wide area network). Memory in these computers or externalmemory and storage devices such as internal and/or external hard disksmay be used to store instructions and data.

Software-based components such as computer-aided design tools 320 anddatabases 330 reside on system 300. During operation, executablesoftware such as the software of computer aided design tools 320 runs onthe processor(s) of system 300. Databases 330 are used to store data forthe operation of system 300. In general, software and data may be storedon non-transitory computer readable storage media (e.g., tangiblecomputer readable storage media). The software code may sometimes bereferred to as software, data, program instructions, instructions, orcode. The non-transitory computer readable storage media may includecomputer memory chips, non-volatile memory such as non-volatilerandom-access memory (NVRAM), one or more hard drives (e.g., magneticdrives or solid state drives), one or more removable flash drives orother removable media, compact discs (CDs), digital versatile discs(DVDs), Blu-ray discs (BDs), other optical media, and floppy diskettes,tapes, or any other suitable memory or storage device(s).

Software stored on the non-transitory computer readable storage mediamay be executed on system 300. When the software of system 300 isinstalled, the storage of system 300 has instructions and data thatcause the computing equipment in system 300 to execute various methods(processes). When performing these processes, the computing equipment isconfigured to implement the functions of circuit design system 300.

The computer aided design (CAD) tools 320, some or all of which aresometimes referred to collectively as a CAD tool, a circuit design tool,or an electronic design automation (EDA) tool, may be provided by asingle vendor or by multiple vendors. Tools 320 may be provided as oneor more suites of tools (e.g., a compiler suite for performing tasksassociated with implementing a circuit design in a programmable logicdevice) and/or as one or more separate software components (tools).Database(s) 330 may include one or more databases that are accessed onlyby a particular tool or tools and may include one or more shareddatabases. Shared databases may be accessed by multiple tools. Forexample, a first tool may store data for a second tool in a shareddatabase. The second tool may access the shared database to retrieve thedata stored by the first tool. This allows one tool to pass informationto another tool. Tools may also pass information between each otherwithout storing information in a shared database if desired.

Illustrative computer aided design tools 420 that may be used in acircuit design system such as circuit design system 300 of FIG. 3 areshown in FIG. 4.

The design process may start with the formulation of functionalspecifications of the integrated circuit design (e.g., a functional orbehavioral description of the integrated circuit design). A circuitdesigner may specify the functional operation of a desired circuitdesign using design and constraint entry tools 464. Design andconstraint entry tools 464 may include tools such as design andconstraint entry aid 466 and design editor 468. Design and constraintentry aids such as aid 466 may be used to help a circuit designer locatea desired design from a library of existing circuit designs and mayprovide computer-aided assistance to the circuit designer for entering(specifying) the desired circuit design.

As an example, design and constraint entry aid 466 may be used topresent screens of options for a user. The user may click on on-screenoptions to select whether the circuit being designed should have certainfeatures. Design editor 468 may be used to enter a design (e.g., byentering lines of hardware description language code), may be used toedit a design obtained from a library (e.g., using a design andconstraint entry aid), or may assist a user in selecting and editingappropriate prepackaged code/designs.

Design and constraint entry tools 464 may be used to allow a circuitdesigner to provide a desired circuit design using any suitable format.For example, design and constraint entry tools 464 may include toolsthat allow the circuit designer to enter a circuit design using truthtables. Truth tables may be specified using text files or timingdiagrams and may be imported from a library. Truth table circuit designand constraint entry may be used for a portion of a large circuit or foran entire circuit.

As another example, design and constraint entry tools 464 may include aschematic capture tool. A schematic capture tool may allow the circuitdesigner to visually construct integrated circuit designs fromconstituent parts such as logic gates and groups of logic gates.Libraries of preexisting integrated circuit designs may be used to allowa desired portion of a design to be imported with the schematic capturetools.

If desired, design and constraint entry tools 464 may allow the circuitdesigner to provide a circuit design to the circuit design system 300using a hardware description language such as Verilog hardwaredescription language (Verilog HDL), Very High Speed Integrated CircuitHardware Description Language (VHDL), SystemVerilog, or a higher-levelcircuit description language such as OpenCL or SystemC, just to name afew. The designer of the integrated circuit design can enter the circuitdesign by writing hardware description language code with editor 468.Blocks of code may be imported from user-maintained or commerciallibraries if desired.

After the design has been entered using design and constraint entrytools 464, behavioral simulation tools 472 may be used to simulate thefunctionality of the circuit design. If the functionality of the designis incomplete or incorrect, the circuit designer can make changes to thecircuit design using design and constraint entry tools 464. Thefunctional operation of the new circuit design may be verified usingbehavioral simulation tools 472 before synthesis operations have beenperformed using tools 474. Simulation tools such as behavioralsimulation tools 472 may also be used at other stages in the design flowif desired (e.g., after logic synthesis). The output of the behavioralsimulation tools 472 may be provided to the circuit designer in anysuitable format (e.g., truth tables, timing diagrams, etc.).

Once the functional operation of the circuit design has been determinedto be satisfactory, logic synthesis and optimization tools 474 maygenerate a gate-level netlist of the circuit design, for example usinggates from a particular library pertaining to a targeted processsupported by a foundry, which has been selected to produce theintegrated circuit. Alternatively, logic synthesis and optimizationtools 474 may generate a gate-level netlist of the circuit design usinggates of a targeted programmable logic device (i.e., in the logic andinterconnect resources of a particular programmable logic device productor product family).

Logic synthesis and optimization tools 474 may optimize the design bymaking appropriate selections of hardware to implement different logicfunctions in the circuit design based on the circuit design data andconstraint data entered by the logic designer using tools 464. As anexample, logic synthesis and optimization tools 474 may performmulti-level logic optimization and technology mapping based on thelength of a combinational path between registers in the circuit designand corresponding timing constraints that were entered by the logicdesigner using tools 464.

After logic synthesis and optimization using tools 474, the circuitdesign system may use tools such as placement, routing, and physicalsynthesis tools 476 to perform physical design steps (layout synthesisoperations). Tools 476 can be used to determine where to place each gateof the gate-level netlist produced by tools 474. For example, if twocounters interact with each other, tools 476 may locate these countersin adjacent regions to reduce interconnect delays or to satisfy timingrequirements specifying the maximum permitted interconnect delay. Tools476 create orderly and efficient implementations of circuit designs forany targeted integrated circuit (e.g., for a given programmableintegrated circuit such as an FPGA).

Tools such as tools 474 and 476 may be part of a compiler suite (e.g.,part of a suite of compiler tools provided by a programmable logicdevice vendor). In certain embodiments, tools such as tools 474, 476,and 478 may also include timing analysis tools such as timingestimators. This allows tools 474 and 476 to satisfy performancerequirements (e.g., timing requirements) before actually producing theintegrated circuit.

After an implementation of the desired circuit design has been generatedusing tools 476, the implementation of the design may be analyzed andtested using analysis tools 478. For example, analysis tools 478 mayinclude timing analysis tools, power analysis tools, or formalverification tools, just to name few.

After satisfactory optimization operations have been completed usingtools 420 and depending on the targeted integrated circuit technology,tools 420 may produce a mask-level layout description of the integratedcircuit or configuration data for programming the programmable logicdevice.

Illustrative operations involved in using tools 420 of FIG. 4 to producethe mask-level layout description of the integrated circuit are shown inFIG. 5. As shown in FIG. 5, a circuit designer may first provide adesign specification 502. The design specification 502 may, in general,be a behavioral description provided in the form of an application code(e.g., C code, C++ code, SystemC code, OpenCL code, etc.). In somescenarios, the design specification may be provided in the form of aregister transfer level (RTL) description 506.

The RTL description may have any form of describing circuit functions atthe register transfer level. For example, the RTL description may beprovided using a hardware description language such as the Veriloghardware description language (Verilog HDL or Verilog), theSystemVerilog hardware description language (SystemVerilog HDL orSystemVerilog), or the Very High Speed Integrated Circuit HardwareDescription Language (VHDL). If desired, a portion or all of the RTLdescription may be provided as a schematic representation or in the formof a code using OpenCL, MATLAB, Simulink, or other high-level synthesis(HLS) language.

In general, the behavioral design specification 502 may include untimedor partially timed functional code (i.e., the application code does notdescribe cycle-by-cycle hardware behavior), whereas the RTL description506 may include a fully timed design description that details thecycle-by-cycle behavior of the circuit at the register transfer level.

Design specification 502 or RTL description 506 may also include targetcriteria such as area use, power consumption, delay minimization, clockfrequency optimization, or any combination thereof. The optimizationconstraints and target criteria may be collectively referred to asconstraints.

Those constraints can be provided for individual data paths, portions ofindividual data paths, portions of a design, or for the entire design.For example, the constraints may be provided with the designspecification 502, the RTL description 506 (e.g., as a pragma or as anassertion), in a constraint file, or through user input (e.g., using thedesign and constraint entry tools 464 of FIG. 4), to name a few.

At step 504, behavioral synthesis (sometimes also referred to asalgorithmic synthesis) may be performed to convert the behavioraldescription into an RTL description 506. Step 504 may be skipped if thedesign specification is already provided in form of an RTL description.

At step 518, behavioral simulation tools 472 may perform an RTLsimulation of the RTL description, which may verify the functionality ofthe RTL description. If the functionality of the RTL description isincomplete or incorrect, the circuit designer can make changes to theHDL code (as an example). During RTL simulation 518, actual resultsobtained from simulating the behavior of the RTL description may becompared with expected results.

During step 508, logic synthesis operations may generate gate-leveldescription 510 using logic synthesis and optimization tools 474 fromFIG. 4. The output of logic synthesis 508 is a gate-level description510 of the design.

During step 512, placement operations using for example placement tools476 of FIG. 4 may place the different gates in gate-level description510 in a preferred location on the targeted integrated circuit to meetgiven target criteria (e.g., minimize area and maximize routingefficiency or minimize path delay and maximize clock frequency orminimize overlap between logic elements, or any combination thereof).The output of placement 512 is a placed gate-level description 513,which satisfies the legal placement constraints of the underlying targetdevice.

During step 515, routing operations using for example routing tools 476of FIG. 4 may connect the gates from the placed gate-level description513. Routing operations may attempt to meet given target criteria (e.g.,minimize congestion, minimize path delay and maximize clock frequency,satisfy minimum delay requirements, or any combination thereof). Theoutput of routing 515 is a mask-level layout description 516 (sometimesreferred to as routed gate-level description 516). The mask-level layoutdescription 516 generated by the design flow of FIG. 5 may sometimes bereferred to as a device configuration bit stream or a deviceconfiguration image.

While placement and routing is being performed at steps 512 and 515,physical synthesis operations 517 may be concurrently performed tofurther modify and optimize the circuit design (e.g., using physicalsynthesis tools 476 of FIG. 4).

Multi-Tenant Usage

In implementations of the disclosure, programmable integrated circuitdevice 10 may be configured using tools described in FIGS. 2-5 tosupport a multi-tenant usage model or scenario. As noted above, examplesof programmable logic devices include programmable arrays logic (PALs),programmable logic arrays (PLAs), field programmable logic arrays(FPLAs), electrically programmable logic devices (EPLDs), electricallyerasable programmable logic devices (EEPLDs), logic cell arrays (LCAs),complex programmable logic devices (CPLDs), and field programmable gatearrays (FPGAs), just to name a few. System configurations in whichdevice 10 is a programmable logic device such as an FPGA is sometimesdescribed as an example but is not intended to limit the scope of thepresent embodiments.

In accordance with an embodiment, FIG. 6 is a diagram of a multitenancysystem such as system 600. As shown in FIG. 6, system 600 may include atleast a host platform provider 602 (e.g., a server, a cloud serviceprovider or “CSP”), a programmable integrated circuit device 10 such asan FPGA, and multiple tenants 604 (sometimes referred to as “clients”).The CSP 602 may interact with FPGA 10 via communications path 680 andmay, in parallel, interact with tenants 604 via communications path 682.The FPGA 10 may separately interact with tenants 604 via communicationspath 684. In a multitenant usage model, FPGA 10 may be provisioned bythe CSP 602 to support each of various tenants/clients 604 running theirown separate applications. It may be assumed that the tenants do nottrust each other, that the clients do not trust the CSP, and that theCSP does not trust the tenants.

The FPGA 10 may include a secure device manager (SDM) 650 that acts as aconfiguration manager and security enclave for the FPGA 10. The SDM 650can conduct reconfiguration and security functions for the FPGA 10. Forexample, the SDM 650, can conduct functions including, but not limitedto, sectorization, PUF key protection, key management, hardencrypt/authenticate engines, and zeroization. Additionally,environmental sensors (not shown) of the FPGA 10 that monitor voltageand temperature can be controlled by the SDM. Furthermore, devicemaintenance functions, such as secure return material authorization(RMA) without revealing encryption keys, secure debug of designs and ARMcode, and secure key managed are additional functions enabled by the SDM650.

Cloud service provider 602 may provide cloud services accelerated on oneor more accelerator devices such as application-specific integratedcircuits (ASICs), graphics processor units (GPUs), and FPGAs to multiplecloud customers (i.e., tenants). In the context of FPGA-as-a-serviceusage model, cloud service provider 602 may offload more than oneworkload to an FPGA 10 so that multiple tenant workloads may runsimultaneously on the FPGA as different partial reconfiguration (PR)workloads. In such scenarios, FPGA 10 can provide security assurancesand PR workload isolation when security-sensitive workloads (orpayloads) are executed on the FPGA.

Cloud service provider 602 may define a multitenancy mode (MTM) sharingand allocation policy 610. The MTM sharing and allocation policy 610 mayset forth a base configuration bitstream such as base static image 612,a partial reconfiguration region whitelist such as PR whitelist 614,peek and poke vectors 616, timing and energy constraints 618 (e.g.,timing and power requirements for each potential tenant or the overallmultitenant system), deterministic data assets 620 (e.g., a hash list ofbinary assets or other reproducible component that can be used to verifythe proper loading of tenant workloads into each PR region), etc. Policy610 is sometimes referred to as an FPGA multitenancy mode contract. Oneor more components of MTM sharing and allocation policy 610 such as thebase static image 612, PR region whitelist 61, and peek/poke vectors 616may be generated by the cloud service provider using design tools 420 ofFIG. 4.

The base static image 612 may define a base design for device 10 (see,e.g., FIG. 7). As shown in FIG. 7, the base static image 612 may definethe input-output interfaces 704, one or more static region(s) 702, andmultiple partial reconfiguration (PR) regions each of which may beassigned to a respective tenant to support an isolated workload. Staticregion 702 may be a region where all parties agree that theconfiguration bits cannot be changed by partial reconfiguration. Forexample, static region may be owned by the server/host/CSP. Any resourceon device 10 should be assigned either to static region 702 or one ofthe PR regions (but not both).

The PR region whitelist 614 may define a list of available PR regions630 (see FIG. 6). Each PR region for housing a particular tenant may bereferred to as a PR “sandbox,” in the sense of providing a trustedexecution environment (TEE) for providing spatial/physical isolation andpreventing potential undesired interference among the multiple tenants.Each PR sandbox may provide assurance that the contained PR tenantworkload (sometimes referred to as the PR client persona) is limited toconfigured its designated subset of the FPGA fabric and is protectedfrom access by other PR workloads. The precise allocation of the PRsandbox regions and the boundaries 660 of each PR sandbox may also bedefined by the base static image. Additional reserved padding area suchas area 706 in FIG. 7 may be used to avoid electrical interference andcoupling effects such as crosstalk. Additional circuitry may also beformed in padding area 706 for actively detecting and/or compensatingunwanted effects generated as a result of electrical interference,noise, or power surge.

Any wires such as wires 662 crossing a PR sandbox boundary may beassigned to either an associated PR sandbox or to the static region 702.If a boundary-crossing wire 662 is assigned to a PR sandbox region,routing multiplexers outside that sandbox region controlling the wireshould be marked as not to be used. If a boundary-cross wire 662 isassigned to the static region, the routing multiplexers inside thatsandbox region controlling the wire should be marked as not belonging tothat sandbox region (e.g., these routing multiplexers should be removedfrom a corresponding PR region mask).

Any hard (non-reconfigurable) embedded intellectual property (IP) blockssuch as memory blocks (e.g., random-access memory blocks) or digitalsignal processing (DSP) blocks that are formed on FPGA 10 may also beassigned either to a PR sandbox or to the static region. In other words,any given hard IP functional block should be completely owned by asingle entity (e.g., any fabric configuration for a respective embeddedfunctional block is either allocated to a corresponding PR sandbox orthe static region).

Disaggregated Compute in Programmable Integrated Circuits

As previously described, disaggregated computing is on the rise in datacenters. CSPs are deploying solutions where processing of a workload isdistributed on disaggregated compute resources such as CPUs and hardwareaccelerators, such as FPGAs, that are connected via network instead ofbeing on the same platform, connected via physical links such as PCIe.The compute disaggregation enables improved resource utilization andlowers Total Cost of Ownership (TCO) by enabling making more efficientuse of available resources. Disaggregation also enables pooling a largenumber of hardware accelerators for large computation making thecomputation more efficient and performant.

Embodiments provide for a novel technique for disaggregated computing inprogrammable integrated circuits, such as the programmable logic devicesdescribed above with respect to FIGS. 1-7. This novel technique is usedto provide for the above-noted improved computation efficiency andperformance in computing architectures seeking to implement disaggregatecomputing. Implementations of the disclosure provide runtime faultdetection, fault location, and circuit recovery in an acceleratordevice, as discussed further below with respect to FIGS. 8-12.

FIG. 8 illustrates a computing device 800 employing a disaggregatecompute component 810 according to one implementation of the disclosure.Computing device 800 represents a communication and data processingdevice including or representing (without limitations) smart voicecommand devices, intelligent personal assistants, home/office automationsystem, home appliances (e.g., washing machines, television sets, etc.),mobile devices (e.g., smartphones, tablet computers, etc.), gamingdevices, handheld devices, wearable devices (e.g., smartwatches, smartbracelets, etc.), virtual reality (VR) devices, head-mounted display(HMDs), Internet of Things (IoT) devices, laptop computers, desktopcomputers, server computers, set-top boxes (e.g., Internet based cabletelevision set-top boxes, etc.), global positioning system (GPS)-baseddevices, automotive infotainment devices, etc.

In some embodiments, computing device 800 includes or works with or isembedded in or facilitates any number and type of other smart devices,such as (without limitation) autonomous machines or artificiallyintelligent agents, such as a mechanical agents or machines, electronicsagents or machines, virtual agents or machines, electromechanical agentsor machines, etc. Examples of autonomous machines or artificiallyintelligent agents may include (without limitation) robots, autonomousvehicles (e.g., self-driving cars, self-flying planes, self-sailingboats, etc.), autonomous equipment self-operating construction vehicles,self-operating medical equipment, etc.), and/or the like. Further,“autonomous vehicles” are not limed to automobiles but that they mayinclude any number and type of autonomous machines, such as robots,autonomous equipment, household autonomous devices, and/or the like, andany one or more tasks or operations relating to such autonomous machinesmay be interchangeably referenced with autonomous driving.

Further, for example, computing device 800 may include a computerplatform hosting an integrated circuit (“IC”), such as a system on achip (“SOC” or “SOC”), integrating various hardware and/or softwarecomponents of computing device 800 on a single chip.

As illustrated, in one embodiment, computing device 800 may include anynumber and type of hardware and/or software components, such as (withoutlimitation) graphics processing unit (“GPU” or simply “graphicsprocessor”) 816, graphics driver (also referred to as “GPU driver”,“graphics driver logic”, “driver logic”, user-mode driver (UMD),user-mode driver framework (UMDF), or simply “driver”) 815, centralprocessing unit (“CPU” or simply “application processor”) 812, hardwareaccelerator 814 (such as programmable logic device 10 described abovewith respect to FIGS. 1-7 including, but not limited to, an FPGA, ASIC,a re-purposed CPU, or a re-purposed GPU, for example), memory 808,network devices, drivers, or the like, as well as input/output (I/O)sources 804, such as touchscreens, touch panels, touch pads, virtual orregular keyboards, virtual or regular mice, ports, connectors, etc.Computing device 800 may include operating system (OS) 806 serving as aninterface between hardware and/or physical resources of the computingdevice 800 and a user.

It is to be appreciated that a lesser or more equipped system than theexample described above may be utilized for certain implementations.Therefore, the configuration of computing device 800 may vary fromimplementation to implementation depending upon numerous factors, suchas price constraints, performance requirements, technologicalimprovements, or other circumstances.

Embodiments may be implemented as any or a combination of: one or moremicrochips or integrated circuits interconnected using a parent board,hardwired logic, software stored by a memory device and executed by amicroprocessor, firmware, an application specific integrated circuit(ASIC), and/or a field programmable gate array (FPGA). The terms“logic”, “module”, “component”, “engine”, “circuitry”, “element”, and“mechanism” may include, by way of example, software, hardware and/or acombination thereof, such as firmware.

In one embodiment, as illustrated, disaggregate compute component 810may be hosted by memory 808 in communication with I/O source(s) 804,such as microphones, speakers, etc., of computing device 800. In anotherembodiment, disaggregate compute component 810 may be part of or hostedby operating system 806. In yet another embodiment, disaggregate computecomponent 810 may be hosted or facilitated by graphics driver 815. Inyet another embodiment, disaggregate compute component 810 may be hostedby or part of a hardware accelerator 814; for example, disaggregatecompute component 810 may be embedded in or implemented as part of theprocessing hardware of hardware accelerator 814, such as in the form ofdisaggregate compute component 840. In yet another embodiment,disaggregate compute component 810 may be hosted by or part of graphicsprocessing unit (“GPU” or simply graphics processor”) 816 or firmware ofgraphics processor 816; for example, disaggregate compute component maybe embedded in or implemented as part of the processing hardware ofgraphics processor 816, such as in the form of disaggregate computecomponent 830. Similarly, in yet another embodiment, disaggregatecompute evaluation component 810 may be hosted by or part of centralprocessing unit (“CPU” or simply “application processor”) 812; forexample, disaggregate compute evaluation component 820 may be embeddedin or implemented as part of the processing hardware of applicationprocessor 812, such as in the form of disaggregate compute component820. In some embodiments, disaggregate compute component 810 may beprovided by one or more processors including one or more of a graphicsprocessor, an application processor, and another processor, wherein theone or more processors are co-located on a common semiconductor package.

It is contemplated that embodiments are not limited to certainimplementation or hosting of disaggregate compute component 810 and thatone or more portions or components of disaggregate compute component 810may be employed or implemented as hardware, software, or any combinationthereof, such as firmware. In one embodiment, for example, thedisaggregate compute component may be hosted by a machine learningprocessing unit which is different from the GPU. In another embodiment,the disaggregate compute component may be distributed between a machinelearning processing unit and a CPU. In another embodiment, thedisaggregate compute component may be distributed between a machinelearning processing unit, a CPU and a GPU. In another embodiment, thedisaggregate compute component may be distributed between a machinelearning processing unit, a CPU, a GPU, and a hardware accelerator.

Computing device 800 may host network interface device(s) to provideaccess to a network, such as a LAN, a wide area network (WAN), ametropolitan area network (MAN), a personal area network (PAN),Bluetooth, a cloud network, a mobile network (e.g., 3rd Generation (3G),4th Generation (4G), etc.), an intranet, the Internet, etc. Networkinterface(s) may include, for example, a wireless network interfacehaving antenna, which may represent one or more antenna(s). Networkinterface(s) may also include, for example, a wired network interface tocommunicate with remote devices via network cable, which may be, forexample, an Ethernet cable, a coaxial cable, a fiber optic cable, aserial cable, or a parallel cable.

Embodiments may be provided, for example, as a computer program productwhich may include one or more machine-readable media having storedthereon machine executable instructions that, when executed by one ormore machines such as a computer, network of computers, or otherelectronic devices, may result in the one or more machines carrying outoperations in accordance with embodiments described herein. Amachine-readable medium may include, but is not limited to, floppydiskettes, optical disks, CD-ROMs (Compact Disc-Read Only Memories), andmagneto-optical disks, ROMs, RAMS, EPROMs (Erasable Programmable ReadOnly Memories), EEPROMs (Electrically Erasable Programmable Read OnlyMemories), magnetic or optical cards, flash memory, or other type ofmedia/machine-readable medium suitable for storing machine-executableinstructions.

Moreover, embodiments may be downloaded as a computer program product,wherein the program may be transferred from a remote computer (e.g., aserver) to a requesting computer (e.g., a client) by way of one or moredata signals embodied in and/or modulated by a carrier wave or otherpropagation medium via a communication link (e.g., a modem and/ornetwork connection).

Throughout the document, term “user” may be interchangeably referred toas “viewer”, “observer”, “speaker”, “person”, “individual”, “end-user”,and/or the like. It is to be noted that throughout this document, termslike “graphics domain” may be referenced interchangeably with “graphicsprocessing unit”, “graphics processor”, or simply “GPU” and similarly,“CPU domain” or “host domain” may be referenced interchangeably with“computer processing unit”, “application processor”, or simply “CPU”.

It is to be noted that terms like “node”, “computing node”, “server”,“server device”, “cloud computer”, “cloud server”, “cloud servercomputer”, “machine”, “host machine”, “device”, “computing device”,“computer”, “computing system”, and the like, may be usedinterchangeably throughout this document. It is to be further noted thatterms like “application”, “software application”, “program”, “softwareprogram”, “package”, “software package”, and the like, may be usedinterchangeably throughout this document. Also, terms like “job”,“input”, “request”, “message”, and the like, may be used interchangeablythroughout this document.

FIG. 9 illustrates disaggregate compute component 810 of FIG. 8,according to one implementation of the disclosure. For brevity, many ofthe details already discussed with reference to FIG. 8 are not repeatedor discussed hereafter. In one embodiment, disaggregate computecomponent 810 may be the same as any of disaggregate compute components810, 820, 830, 840 described with respect to FIG. 8 and may include anynumber and type of components, such as (without limitations): runtimerecovery component 903.

Computing device 800 is further shown to include user interface 919(e.g., graphical user interface (GUI) based user interface, Web browser,cloud-based platform user interface, software application-based userinterface, other user or application programming interfaces (APIs),etc.). Computing device 800 may further include I/O source(s) 804 havinginput component (s) 931, such as camera(s) 942 (e.g., Intel® RealSense™camera), sensors, microphone(s) 941, etc., and output component(s) 933,such as display device(s) or simply display(s) 944 (e.g., integraldisplays, tensor displays, projection screens, display screens, etc.),speaker devices(s) or simply speaker(s), etc.

Computing device 800 is further illustrated as having access to and/orbeing in communication with one or more database(s) 925 and/or one ormore of other computing devices over one or more communication medium(s)930 (e.g., networks such as a proximity network, a cloud network, theInternet, etc.).

In some embodiments, database(s) 925 may include one or more of storagemediums or devices, repositories, data sources, etc., having any amountand type of information, such as data, metadata, etc., relating to anynumber and type of applications, such as data and/or metadata relatingto one or more users, physical locations or areas, applicable laws,policies and/or regulations, user preferences and/or profiles, securityand/or authentication data, historical and/or other details, and/or thelike.

As aforementioned, computing device 800 may host I/O sources 804including input component(s) 931 and output component(s) 933. In oneembodiment, input component(s) 931 may include a sensor array including,but not limited to, microphone(s) 941 (e.g., ultrasound microphones),camera(s) 942 (e.g., two-dimensional (2D) cameras, three-dimensional(3D) cameras, infrared (IR) cameras, depth-sensing cameras, etc.),capacitors, radio components, radar components, scanners, and/oraccelerometers, etc. Similarly, output component(s) 933 may include anynumber and type of display device(s) 944, projectors, light-emittingdiodes (LEDs), speaker(s) 943, and/or vibration motors, etc.

As aforementioned, terms like “logic”, “module”, “component”, “engine”,“circuitry”, “element”, and “mechanism” may include, by way of example,software or hardware and/or a combination thereof, such as firmware. Forexample, logic may itself be or include or be associated with circuitryat one or more devices, such as disaggregate compute component 820,disaggregate compute component 830, and/or disaggregate computecomponent 840 hosted by application processor 812, graphics processor816, and/or hardware accelerator 814, respectively, of FIG. 8 having tofacilitate or execute the corresponding logic to perform certain tasks.

For example, as illustrated, input component (s) 931 may include anynumber and type of microphone(s) 941, such as multiple microphones or amicrophone array, such as ultrasound microphones, dynamic microphones,fiber optic microphones, laser microphones, etc. It is contemplated thatone or more of microphone(s) 941 serve as one or more input devices foraccepting or receiving audio inputs (such as human voice) into computingdevice 800 and converting this audio or sound into electrical signals.Similarly, it is contemplated that one or more of camera(s) 942 serve asone or more input devices for detecting and capturing of image and/orvideos of scenes, objects, etc., and provide the captured data as videoinputs into computing device 800.

As previously described, disaggregated computing is on the rise in datacenters. CSPs are deploying solutions where processing of a workload isdistributed on disaggregated compute resources such as CPUs and hardwareaccelerators, such as FPGAs, that are connected via network instead ofbeing on the same platform, connected via physical links such as PCIe.The compute disaggregation enables improved resource utilization andlowers Total Cost of Ownership (TCO) by enabling making more efficientuse of available resources. Disaggregation also enables pooling a largenumber of hardware accelerators for large computation making thecomputation more efficient and performant.

Embodiments provide for a novel technique for disaggregate computing fordistributed confidential computing environments. This novel technique isused to provide for the above-noted improved computation efficiency andperformance in computing architectures seeking to implement disaggregatecomputing. Implementations of the disclosure utilize a disaggregatecompute component 810 to provide runtime fault detection, faultlocation, and circuit recovery in an accelerator device.

With respect to FIG. 9, the disaggregate compute component 810 includesruntime recovery component 903 to perform the disaggregated computingfor programmable integrated circuits of the disaggregate computecomponent 810 described herein. Further details of runtime recoverycomponent 903 are described below with respect to FIGS. 10-12.

Runtime Fault Detection, Fault Location, and Circuit Recovery in anAccelerator Device

In some embodiments, an apparatus, system, or process is to provideruntime fault detection, fault location, and circuit recovery in anaccelerator device. In one implementation, runtime recovery component903 described with respect to FIG. 9 provides the runtime faultdetection, fault location, and circuit recovery in an acceleratordevice.

The accelerator device could be a GPU, a re-purposed GPU, a re-purposedCPU, an ASIC, or an FPGA, to name a few examples. In implementations ofthe disclosure, an FPGA is specifically discussed. However, any type ofhardware accelerator device and/or programmable logic integrated circuit(IC) (also referred to as a programmable IC) may utilize implementationsof the disclosure and implements are not specifically limited toutilization in an FPGA environment. Examples of programmable logic ICsinclude programmable arrays logic (PALs), programmable logic arrays(PLAs), field programmable logic arrays (FPLAs), electricallyprogrammable logic devices (EPLDs), electrically erasable programmablelogic devices (EEPLDs), logic cell arrays (LCAs), complex programmablelogic devices (CPLDs), and field programmable gate arrays (FPGAs), justto name a few. However, for ease of discussion and illustration, thespecific example of an FPGA is described herein.

Emerging FPGA usage models in the data center where FPGA resources areshared between mutually untrusting parties are subject to faultinjection attacks. A malicious party may deliberately construct a designthat implements logic to mount a remote fault injection attack on thedevice, the effects of which can range from single bits flips inneighboring logic to Denial-of-Service to catastrophic damage to theplatform (permanent Denial-of-Service). Related trends, such as edgecloud, are driving security requirements around physical fault injectionattack resistance where faults injected via laser or electro-magnetic(EM) transients are now in scope.

In some conventional systems, fault injection sensors are deployed onthe device to detect fault injection scenarios. For example, faultinjection sensors have been deployed in previous solutions, such assmart cards and content protection SoCs. However, these deployments inconventional systems are typically single sensor or small scale sensordeployments.

Drawbacks of the single sensor or small scale sensor deployments ofconventional systems include inability of these deployments to spatiallylocate an attack source. Furthermore, the single or small scale sensordeployments are unable to leverage a network of sensors in order toimprove fault detection performance, reduce false positives, and/orbuild a reliable sensing system.

Furthermore, conventional systems implement a static approach tomitigation of detected faults. For example, conventional systemsimplement static approaches including off-line validators, virusscanners, and/or design rule checking (DRC) components to mitigatedetected faults. However, the creation of components for detection andmitigation of faults in such a static manner is difficult in the face ofa dedicated attacker. For example, physical fault injection attacks,such as laser and EM fault injection, such static approaches are notsufficient in terms of timely and effectively addressing the faultinjection attacks.

Implementations of the disclosure address the above-noted technicaldrawbacks by providing for runtime fault detection, fault location, andcircuit recovery in an accelerator device. Implementations provide anetwork of sensors in an accelerator device, where the network ofsensors is capable of monitoring timing margins. Implementations of thedisclosure further include an infrastructure for delivering sensor datato components of the accelerator device for timely responses to detectedfaults and for purposes of spatially locating the fault on theaccelerator device. Lastly, implementations of the disclosure includethe accelerator device implementing a security policy to address thetimely and spatially-located faults on the accelerator device.

Advantages of implementations of the disclosure include that runtime,on-device mitigation of faults is superior to static approaches. Theruntime, on-device detection and mitigation of fault injection attacksprovided by the network of sensors and fault detection/mitigationinfrastructure of implementations of the disclosure provides forimproved (e.g., quicker) detection of fault injection attacks (e.g.,during runtime), more precise detection of spatial location of faults onan accelerator device, and more secure mitigation of such detectedfaults.

FIG. 10 illustrates an accelerator device 1000 for providing runtimefault detection, location and circuit recovery, in accordance withimplementations of the disclosure. In some implementations, theaccelerator device 1000 could be a GPU, a re-purposed GPU, a re-purposedCPU, an ASIC, or an FPGA, to name a few examples. In one implementation,accelerator device 1000 is an FPGA, which may be the same as FPGA 10described above with respect to FIGS. 1-7.

As illustrated, accelerator device 1000 includes a network of sensors1006 distributed among the network-on-chip (NoC) infrastructure 1002 ofthe accelerator device 1000. The accelerator device 1000 may include aplurality of sectors indicated by sector boundaries 1004. Each sectormay host a tenant 1010. In some implementations, the tenants 1010 areeach PR bitstream tenants of the FPGA. Each sector may have one or moresensors 1006 associated with the sector.

In one implementation, one of the tenants is a malicious tenant 1015performing a fault injection attack on the accelerator device 1000. Forexample, malicious tenant 1015 may perform a permanent denial of service(PDoS) 1008 fault injection attacl on accelerator device 1000. A PDoSattack 1008 refers to a denial of service via hardware sabotage. Onemethod of conducting a PDoS attack 1008 is commonly referred to asphlashing. During such an attack, an attacker bricks a device ordestroys firmware, rendering the device or an entire system useless.

In implementations of the disclosure, the network of sensors 1006 mayprovide a variety of information including, but not limited to, ECCstatistics 1021, voltage/current/temperature sensing data 1022, sensordata 1023, and tamper detect data 1024 to an anomaly detection component1020 of the accelerator device 1000. Based on the provided informationfrom the network of sensors 1006, the anomaly detection component 1020can generate a mitigation decision 1030 in response to the providedinformation. The mitigation decision 1030 may be a response to adetected fault and can include, but is not limited to, decisions such aslocating the fault, tear-down of a sector of the device, built-inself-test (BIST) of the device, and/or re-allocating workload on thedevice, for example.

FIG. 11 illustrates an accelerator device 1100 for providing runtimefault detection, location and circuit recovery, in accordance withimplementations of the disclosure. In some implementations, theaccelerator device 1100 could be a GPU, a re-purposed GPU, a re-purposedCPU, an ASIC, or an FPGA, to name a few examples. In one implementation,accelerator device 1100 is an FPGA, which may be the same as FPGA 10described above with respect to FIGS. 1-7. In one implementation,accelerator device 1100 is the same as accelerator device 1000 describedwith respect to FIG. 10.

As illustrated, the accelerator device 1100 includes a sensor network1110 communicably coupled to a sensor aggregator 1120, which iscommunicably coupled to an SDM 1130. The sensor network 1110 may includea plurality of sensors (such as sensors 1006 described with respect toFIG. 10), disbursed throughout sectors of the accelerator device 1100.In some implementations, sensors of the sensor network 1110 can monitortiming margin (e.g., timing margin monitors (TMMs)).

Implementations of the disclosure further include a high performancepoint-to-point bus infrastructure within the accelerator device 1100that is capable of delivering sensor data 1140 (e.g., sensor codewords)to a sensor aggregator 1120. The sensor aggregator 1120 may beimplemented as hardware, software, or any combination thereof, such asfirmware. The sensor aggregator 1120 provides for a timely systemresponse to the detected faults. The sensor aggregator 1120 can consumesensor codewords and, based on the consumed sensor codewords, spatiallylocate the fault on the accelerator device 1100. In someimplementations, the sensor aggregator provides for thresholdingoperations to determine when a fault is detected at one or more sensorsof the sensor network 1110, or can provide for any other technique foranomaly detection based on data from the sensor network 1110. In oneexample, another technique for anomaly detection may include performingsignal processing of a time-series of the sensor data to extractfeatures that are used to train a machine learning model to identify thefault condition.

The sensor aggregator 1120 then communicates the detected faultinformation as events 1170 to an SDM 1130. The events 1170 may includeSDM interrupts, general purpose input/output (GPIO) data movement (e.g.,sensor ID, sector ID, etc.). In one implementation, SDM 1130 is the sameas SDM 650 described with respect to FIG. 6.

In implementations of the disclosure, the SDM 1130 can implement asecurity policy corresponding to detected faults in the acceleratordevice 1100. An example security policy implemented by the SDM 1130 mayinclude the tear-down of resources containing the fault (e.g., quickkill 1180), and subsequent testing and reallocation of those resourcesto new users/clients/tenants on the accelerator device 1100. Otheroptions for mitigation by the SDM 1130 may include standard PRoperation, sector PR reconfiguration, targeted freeze of memory blocks(such as card random access memory (CRAM) of the accelerator device1100), and so on.

The SDM 1130 can further provide for configuration 1160 of sensors ofthe sensor network 1100. The sensor configuration 1160 from the SDM 1130may be managed and implemented by the sensor aggregator 1120 via sensorconfiguration 1150.

FIG. 12 is a flow diagram illustrating a method 1200 for runtime faultdetection, fault location, and circuit recovery in an acceleratordevice, in accordance with implementations of the disclosure. Method1200 may be performed by processing logic that may comprise hardware(e.g., circuitry, dedicated logic, programmable logic, etc.), software(such as instructions run on a processing device), or a combinationthereof. More particularly, the method 1200 may be implemented in one ormore modules as a set of logic instructions stored in a machine- orcomputer-readable storage medium such as RAM, ROM, PROM, firmware, flashmemory, etc., in configurable logic such as, for example, programmablelogic arrays (PLAs), field-programmable gate arrays (FPGAs), complexprogrammable logic devices (CPLDs), in fixed-functionality logichardware using circuit technology such as, for example,application-specific integrated circuit (ASIC), complementary metaloxide semiconductor (CMOS) or transistor-transistor logic (TTL)technology, or any combination thereof.

The process of method 1200 is illustrated in linear sequences forbrevity and clarity in presentation; however, it is contemplated thatany number of them can be performed in parallel, asynchronously, or indifferent orders. Further, for brevity, clarity, and ease ofunderstanding, many of the components and processes described withrespect to FIGS. 10-11 may not be repeated or discussed hereafter. Inone implementation, an accelerator device, such as FPGA 1110 describedwith respect to FIG. 11, may perform method 1200.

Method 1200 begins at block 1210 where the accelerator device mayreceive, by a sensor aggregator of the accelerator device, sensor datafrom a sensor network of the accelerator device. At block 1220, theaccelerator device may analyze, by the sensor aggregator, the sensordata to detect a fault condition.

Subsequently, at block 1230, the accelerator device may determine, bythe sensor aggregator, a spatial location of the fault condition basedon the sensor data. Lastly, at block 1240, the accelerator device maygenerate, by the sensor aggregator, an event for a secure device managerof the accelerator device to cause the SDM to mitigate the faultcondition. In one implementation, the SDM to mitigate the faultcondition may include at least one of tearing down a partialreconfiguration (PR) persona of a sector of the apparatus, reconfiguringa sector PR of a sector of the apparatus corresponding to the detectedfault, performing a targeted freeze of a memory block of a sector of theapparatus corresponding to the detected fault, or power-gating at leastone of the plurality of sensors.

Flowcharts representative of example hardware logic, machine readableinstructions, hardware implemented state machines, and/or anycombination thereof for implementing the systems, already discussed. Themachine readable instructions may be one or more executable programs orportion(s) of an executable program for execution by a computerprocessor. The program may be embodied in software stored on anon-transitory computer readable storage medium such as a CD-ROM, afloppy disk, a hard drive, a DVD, a Blu-ray disk, or a memory associatedwith the processor, but the whole program and/or parts thereof couldalternatively be executed by a device other than the processor and/orembodied in firmware or dedicated hardware. Further, although theexample program is described with reference to the flowchartsillustrated in the various figures herein, many other methods ofimplementing the example computing system may alternatively be used. Forexample, the order of execution of the blocks may be changed, and/orsome of the blocks described may be changed, eliminated, or combined.Additionally, or alternatively, any or all of the blocks may beimplemented by one or more hardware circuits (e.g., discrete and/orintegrated analog and/or digital circuitry, an FPGA, an ASIC, acomparator, an operational-amplifier (op-amp), a logic circuit, etc.)structured to perform the corresponding operation without executingsoftware or firmware.

The machine readable instructions described herein may be stored in oneor more of a compressed format, an encrypted format, a fragmentedformat, a compiled format, an executable format, a packaged format, etc.Machine readable instructions as described herein may be stored as data(e.g., portions of instructions, code, representations of code, etc.)that may be utilized to create, manufacture, and/or produce machineexecutable instructions. For example, the machine readable instructionsmay be fragmented and stored on one or more storage devices and/orcomputing devices (e.g., servers). The machine readable instructions mayutilize one or more of installation, modification, adaptation, updating,combining, supplementing, configuring, decryption, decompression,unpacking, distribution, reassignment, compilation, etc. in order tomake them directly readable, interpretable, and/or executable by acomputing device and/or other machine. For example, the machine readableinstructions may be stored in multiple parts, which are individuallycompressed, encrypted, and stored on separate computing devices, whereinthe parts when decrypted, decompressed, and combined form a set ofexecutable instructions that implement a program such as that describedherein.

In another example, the machine readable instructions may be stored in astate in which they may be read by a computer, but utilize addition of alibrary (e.g., a dynamic link library (DLL)), a software development kit(SDK), an application programming interface (API), etc. in order toexecute the instructions on a particular computing device or otherdevice. In another example, the machine readable instructions may beconfigured (e.g., settings stored, data input, network addressesrecorded, etc.) before the machine readable instructions and/or thecorresponding program(s) can be executed in whole or in part. Thus, thedisclosed machine readable instructions and/or corresponding program(s)are intended to encompass such machine readable instructions and/orprogram(s) regardless of the particular format or state of the machinereadable instructions and/or program(s) when stored or otherwise at restor in transit.

The machine readable instructions described herein can be represented byany past, present, or future instruction language, scripting language,programming language, etc. For example, the machine readableinstructions may be represented using any of the following languages: C,C++, Java, C #, Perl, Python, JavaScript, HyperText Markup Language(HTML), Structured Query Language (SQL), Swift, etc.

As mentioned above, the example processes of FIGS. 5 and/or 6 may beimplemented using executable instructions (e.g., computer and/or machinereadable instructions) stored on a non-transitory computer and/ormachine readable medium such as a hard disk drive, a flash memory, aread-only memory, a compact disk, a digital versatile disk, a cache, arandom-access memory and/or any other storage device or storage disk inwhich information is stored for any duration (e.g., for extended timeperiods, permanently, for brief instances, for temporarily buffering,and/or for caching of the information). As used herein, the termnon-transitory computer readable medium is expressly defined to includeany type of computer readable storage device and/or storage disk and toexclude propagating signals and to exclude transmission media.

“Including” and “comprising” (and all forms and tenses thereof) are usedherein to be open ended terms. Thus, whenever a claim employs any formof “include” or “comprise” (e.g., comprises, includes, comprising,including, having, etc.) as a preamble or within a claim recitation ofany kind, it is to be understood that additional elements, terms, etc.may be present without falling outside the scope of the correspondingclaim or recitation. As used herein, when the phrase “at least” is usedas the transition term in, for example, a preamble of a claim, it isopen-ended in the same manner as the term “comprising” and “including”are open ended.

The term “and/or” when used, for example, in a form such as A, B, and/orC refers to any combination or subset of A, B, C such as (1) A alone,(2) B alone, (3) C alone, (4) A with B, (5) A with C, (6) B with C, and(7) A with B and with C. As used herein in the context of describingstructures, components, items, objects and/or things, the phrase “atleast one of A and B” is intended to refer to implementations includingany of (1) at least one A, (2) at least one B, and (3) at least one Aand at least one B. Similarly, as used herein in the context ofdescribing structures, components, items, objects and/or things, thephrase “at least one of A or B” is intended to refer to implementationsincluding any of (1) at least one A, (2) at least one B, and (3) atleast one A and at least one B. As used herein in the context ofdescribing the performance or execution of processes, instructions,actions, activities and/or steps, the phrase “at least one of A and B”is intended to refer to implementations including any of (1) at leastone A, (2) at least one B, and (3) at least one A and at least one B.Similarly, as used herein in the context of describing the performanceor execution of processes, instructions, actions, activities and/orsteps, the phrase “at least one of A or B” is intended to refer toimplementations including any of (1) at least one A, (2) at least one B,and (3) at least one A and at least one B.

As used herein, singular references (e.g., “a”, “an”, “first”, “second”,etc.) do not exclude a plurality. The term “a” or “an” entity, as usedherein, refers to one or more of that entity. The terms “a” (or “an”),“one or more”, and “at least one” can be used interchangeably herein.Furthermore, although individually listed, a plurality of means,elements or method actions may be implemented by, e.g., a single unit orprocessor. Additionally, although individual features may be included indifferent examples or claims, these may possibly be combined, and theinclusion in different examples or claims does not imply that acombination of features is not feasible and/or advantageous.

Descriptors “first,” “second,” “third,” etc. are used herein whenidentifying multiple elements or components which may be referred toseparately. Unless otherwise specified or understood based on theircontext of use, such descriptors are not intended to impute any meaningof priority, physical order or arrangement in a list, or ordering intime but are merely used as labels for referring to multiple elements orcomponents separately for ease of understanding the disclosed examples.In some examples, the descriptor “first” may be used to refer to anelement in the detailed description, while the same element may bereferred to in a claim with a different descriptor such as “second” or“third.” In such instances, it should be understood that suchdescriptors are used merely for ease of referencing multiple elements orcomponents.

The following examples pertain to further embodiments. Example 1 is anapparatus to facilitate runtime fault detection, fault location, andcircuit recovery in an accelerator device. The apparatus of Example 1comprises a sensor network comprising a plurality of sensors; a securedevice manager (SDM); and a sensor aggregator communicably coupled tothe sensor network and the SDM, the sensor aggregator to: receive sensordata from the sensor network; analyze the sensor data to detect a faultcondition; determine a spatial location of the fault condition based onthe sensor data; and generate an event for the SDM to cause the SDM tomitigate the fault condition.

In Example 2, the subject matter of Example 1 can optionally includewherein the SDM comprises a configuration manager and security enclavefor the apparatus. In Example 3, the subject matter of any one ofExamples 1-2 can optionally include wherein each sensor of the pluralityof sensors in the sensor network comprises a timing margin monitor. InExample 4, the subject matter of any one of Examples 1-3 can optionallyinclude further comprising a plurality of sectors, wherein each sectorcomprises a portion of the plurality of sensors of the sensor network,and wherein the spatial location is to reference at least one of theplurality of sectors.

In Example 5, the subject matter of any one of Examples 1-4 canoptionally include wherein the sensor aggregator to process the sensordata further comprises the sensor aggregator to at least one of: apply athresholding operation to the sensor data to detect the fault condition,or perform signal processing of a time-series of the sensor data toextract features that are used to train a machine learning model toidentify the fault condition. In Example 6, the subject matter of anyone of Examples 1-5 can optionally include further comprising apoint-to-point bus routing interface to communicably couple the sensornetwork and the sensor aggregator.

In Example 7, the subject matter of any one of Examples 1-6 canoptionally include wherein the SDM to mitigate the fault condition by atleast one of tearing down a partial reconfiguration (PR) persona of asector of the apparatus, reconfiguring a sector PR of a sector of theapparatus corresponding to the detected fault, performing a targetedfreeze of a memory block of a sector of the apparatus corresponding tothe detected fault, or power-gating at least one of the plurality ofsensors. In Example 8, the subject matter of any one of Examples 1-7 canoptionally include wherein the apparatus comprises a hardwareaccelerator device comprising at least one a graphic processing unit(GPU), a central processing unit (CPU), or a programmable integratedcircuit (IC).

In Example 9, the subject matter of any one of Examples 1-8 canoptionally include wherein the programmable IC comprises at least one ofa field programmable gate array (FPGA), a programmable array logic(PAL), a programmable logic array (PLA), a field programmable logicarray (FPLA), an electrically programmable logic device (EPLD), anelectrically erasable programmable logic device (EEPLD), a logic cellarray (LCA), or a complex programmable logic devices (CPLD).

Example 10 is a method for facilitating runtime fault detection, faultlocation, and circuit recovery in an accelerator device. The method ofExample 10 can include receiving, by a sensor aggregator of a hardwareaccelerator device, sensor data from a sensor network comprising aplurality of sensors of the hardware accelerator device; analyzing, bythe sensor aggregator, the sensor data to detect a fault condition;determining, by the sensor aggregator, a spatial location of the faultcondition based on the sensor data; and generating, by the sensoraggregator, an event for a secure device manager (SDM) of the hardwareaccelerator device to cause the SDM to mitigate the fault condition.

In Example 11, the subject matter of Example 10 can optionally includewherein each sensor of the plurality of sensors in the sensor networkcomprises a timing margin monitor. In Example 12, the subject matter ofany one of Examples 10-11 can optionally include wherein the hardwareaccelerator device comprises a plurality of sectors, wherein each sectorcomprises a portion of the plurality of sensors of the sensor network,and wherein the spatial location is to reference at least one of theplurality of sectors. In Example 13, the subject matter of any one ofExamples 10-12 can optionally include wherein the sensor aggregator toprocess the sensor data further comprises the sensor aggregator to atleast one of: apply a thresholding operation to the sensor data todetect the fault condition, or perform signal processing of atime-series of the sensor data to extract features that are used totrain a machine learning model to identify the fault condition.

In Example 14, the subject matter of any one of Examples 10-13 canoptionally include wherein the SDM to mitigate the fault condition by atleast one of tearing down a partial reconfiguration (PR) persona of asector of the apparatus, reconfiguring a sector PR of a sector of theapparatus corresponding to the detected fault, performing a targetedfreeze of a memory block of a sector of the apparatus corresponding tothe detected fault, or power-gating at least one of the plurality ofsensors. In Example 15, the subject matter of any one of Examples 10-14can optionally include wherein the apparatus comprises a hardwareaccelerator device comprising at least one a graphic processing unit(GPU), a central processing unit (CPU), or a programmable integratedcircuit (IC), and wherein the programmable IC comprises at least one ofa field programmable gate array (FPGA), a programmable array logic(PAL), a programmable logic array (PLA), a field programmable logicarray (FPLA), an electrically programmable logic device (EPLD), anelectrically erasable programmable logic device (EEPLD), a logic cellarray (LCA), or a complex programmable logic devices (CPLD).

Example 16 is a non-transitory machine readable storage medium forfacilitating runtime fault detection, fault location, and circuitrecovery in an accelerator device. The non-transitory computer-readablestorage medium of Example 16 having stored thereon executable computerprogram instructions that, when executed by one or more processors,cause the one or more processors to perform operations comprisingreceive, by a sensor aggregator of a hardware accelerator devicecomprising the at least one processor, sensor data from a sensor networkcomprising a plurality of sensors of the hardware accelerator device;analyze, by the sensor aggregator, the sensor data to detect a faultcondition; determine, by the sensor aggregator, a spatial location ofthe fault condition based on the sensor data; and generate, by thesensor aggregator, an event for a secure device manager (SDM) of thehardware accelerator device to cause the SDM to mitigate the faultcondition.

In Example 17, the subject matter of Example 16 can optionally includewherein each sensor of the plurality of sensors in the sensor networkcomprises a timing margin monitor. In Example 18, the subject matter ofExamples 16-17 can optionally include wherein the hardware acceleratordevice comprises a plurality of sectors, wherein each sector comprises aportion of the plurality of sensors of the sensor network, and whereinthe spatial location is to reference at least one of the plurality ofsectors.

In Example 19, the subject matter of Examples 16-18 can optionallyinclude wherein the sensor aggregator to process the sensor data furthercomprises the sensor aggregator to at least one of: apply a thresholdingoperation to the sensor data to detect the fault condition, or performsignal processing of a time-series of the sensor data to extractfeatures that are used to train a machine learning model to identify thefault condition. In Example 20, the subject matter of Examples 16-19 canoptionally include wherein the SDM to mitigate the fault condition by atleast one of tearing down a partial reconfiguration (PR) persona of asector of the apparatus, reconfiguring a sector PR of a sector of theapparatus corresponding to the detected fault, performing a targetedfreeze of a memory block of a sector of the apparatus corresponding tothe detected fault, or power-gating at least one of the plurality ofsensors.

Example 21 is a system for facilitating runtime fault detection, faultlocation, and circuit recovery in an accelerator device. The system ofExample 21 can optionally include a memory, a sensor network comprisinga plurality of sensors, a secure device manager (SDM), and a sensoraggregator communicably coupled to the memory, the sensor network, andthe SDM. The sensor aggregator of the system of Example 21 can beconfigured to receive sensor data from the sensor network; analyze thesensor data to detect a fault condition; determine a spatial location ofthe fault condition based on the sensor data; and generate an event forthe SDM to cause the SDM to mitigate the fault condition.

In Example 22, the subject matter of Example 21 can optionally includewherein the SDM comprises a configuration manager and security enclavefor the apparatus. In Example 23, the subject matter of any one ofExamples 21-22 can optionally include wherein each sensor of theplurality of sensors in the sensor network comprises a timing marginmonitor. In Example 24, the subject matter of any one of Examples 21-23can optionally include further comprising a plurality of sectors,wherein each sector comprises a portion of the plurality of sensors ofthe sensor network, and wherein the spatial location is to reference atleast one of the plurality of sectors.

In Example 25, the subject matter of any one of Examples 21-24 canoptionally include wherein the sensor aggregator to process the sensordata further comprises the sensor aggregator to at least one of: apply athresholding operation to the sensor data to detect the fault condition,or perform signal processing of a time-series of the sensor data toextract features that are used to train a machine learning model toidentify the fault condition. In Example 26, the subject matter of anyone of Examples 21-25 can optionally include further comprising apoint-to-point bus routing interface to communicably couple the sensornetwork and the sensor aggregator.

In Example 27, the subject matter of any one of Examples 21-26 canoptionally include wherein the SDM to mitigate the fault condition by atleast one of tearing down a partial reconfiguration (PR) persona of asector of the apparatus, reconfiguring a sector PR of a sector of theapparatus corresponding to the detected fault, performing a targetedfreeze of a memory block of a sector of the apparatus corresponding tothe detected fault, or power-gating at least one of the plurality ofsensors. In Example 28, the subject matter of any one of Examples 21-27can optionally include wherein the apparatus comprises a hardwareaccelerator device comprising at least one a graphic processing unit(GPU), a central processing unit (CPU), or a programmable integratedcircuit (IC).

In Example 29, the subject matter of any one of Examples 21-28 canoptionally include wherein the programmable IC comprises at least one ofa field programmable gate array (FPGA), a programmable array logic(PAL), a programmable logic array (PLA), a field programmable logicarray (FPLA), an electrically programmable logic device (EPLD), anelectrically erasable programmable logic device (EEPLD), a logic cellarray (LCA), or a complex programmable logic devices (CPLD).

Example 30 is an apparatus for facilitating runtime fault detection,fault location, and circuit recovery in an accelerator device accordingto implementations of the disclosure. The apparatus of Example 30 cancomprise means for receiving, by a sensor aggregator of a hardwareaccelerator device, sensor data from a sensor network comprising aplurality of sensors of the hardware accelerator device; means foranalyzing the sensor data to detect a fault condition; means fordetermining a spatial location of the fault condition based on thesensor data; and means for generating an event for a secure devicemanager (SDM) of the hardware accelerator device to cause the SDM tomitigate the fault condition.

In Example 31, the subject matter of Example 30 can optionally includethe apparatus further configured to perform the method of any one of theExamples 11 to 15.

Example 32 is at least one machine readable medium comprising aplurality of instructions that in response to being executed on acomputing device, cause the computing device to carry out a methodaccording to any one of Examples 10-15. Example 33 is an apparatus forfacilitating runtime fault detection, fault location, and circuitrecovery in an accelerator device, configured to perform the method ofany one of Examples 10-15. Example 34 is an apparatus for facilitatingruntime fault detection, fault location, and circuit recovery in anaccelerator device comprising means for performing the method of any oneof claims 10 to 15. Specifics in the Examples may be used anywhere inone or more embodiments.

The foregoing description and drawings are to be regarded in anillustrative rather than a restrictive sense. Persons skilled in the artcan understand that various modifications and changes may be made to theembodiments described herein without departing from the broader spiritand scope of the features set forth in the appended claims.

What is claimed is:
 1. An apparatus comprising: a sensor networkcomprising a plurality of sensors; a secure device manager (SDM); and asensor aggregator communicably coupled to the sensor network and theSDM, the sensor aggregator to: receive sensor data from the sensornetwork; analyze the sensor data to detect a fault condition; determinea spatial location of the fault condition based on the sensor data; andgenerate an event for the SDM to cause the SDM to mitigate the faultcondition.
 2. The apparatus of claim 1, wherein the SDM comprises aconfiguration manager and security enclave for the apparatus.
 3. Theapparatus of claim 1, wherein each sensor of the plurality of sensors inthe sensor network comprises a timing margin monitor.
 4. The apparatusof claim 1, further comprising a plurality of sectors, wherein eachsector comprises a portion of the plurality of sensors of the sensornetwork, and wherein the spatial location is to reference at least oneof the plurality of sectors.
 5. The apparatus of claim 1, wherein thesensor aggregator to process the sensor data further comprises thesensor aggregator to at least one of: apply a thresholding operation tothe sensor data to detect the fault condition, or perform signalprocessing of a time-series of the sensor data to extract features thatare used to train a machine learning model to identify the faultcondition.
 6. The apparatus of claim 1, further comprising apoint-to-point bus routing interface to communicably couple the sensornetwork and the sensor aggregator.
 7. The apparatus of claim 1, whereinthe SDM to mitigate the fault condition by at least one of tearing downa partial reconfiguration (PR) persona of a sector of the apparatus,reconfiguring a sector PR of a sector of the apparatus corresponding tothe detected fault, performing a targeted freeze of a memory block of asector of the apparatus corresponding to the detected fault, orpower-gating at least one of the plurality of sensors.
 8. The apparatusof claim 1, wherein the apparatus comprises a hardware acceleratordevice comprising at least one a graphic processing unit (GPU), acentral processing unit (CPU), or a programmable integrated circuit(IC).
 9. The apparatus of claim 8, wherein the programmable IC comprisesat least one of a field programmable gate array (FPGA), a programmablearray logic (PAL), a programmable logic array (PLA), a fieldprogrammable logic array (FPLA), an electrically programmable logicdevice (EPLD), an electrically erasable programmable logic device(EEPLD), a logic cell array (LCA), or a complex programmable logicdevices (CPLD).
 10. A method comprising: receiving, by a sensoraggregator of a hardware accelerator device, sensor data from a sensornetwork comprising a plurality of sensors of the hardware acceleratordevice; analyzing, by the sensor aggregator, the sensor data to detect afault condition; determining, by the sensor aggregator, a spatiallocation of the fault condition based on the sensor data; andgenerating, by the sensor aggregator, an event for a secure devicemanager (SDM) of the hardware accelerator device to cause the SDM tomitigate the fault condition.
 11. The method of claim 10, wherein eachsensor of the plurality of sensors in the sensor network comprises atiming margin monitor.
 12. The method of claim 10, wherein the hardwareaccelerator device comprises a plurality of sectors, wherein each sectorcomprises a portion of the plurality of sensors of the sensor network,and wherein the spatial location is to reference at least one of theplurality of sectors.
 13. The method of claim 10, wherein the sensoraggregator to process the sensor data further comprises the sensoraggregator to at least one of: apply a thresholding operation to thesensor data to detect the fault condition, or perform signal processingof a time-series of the sensor data to extract features that are used totrain a machine learning model to identify the fault condition.
 14. Themethod of claim 10, wherein the SDM to mitigate the fault condition byat least one of tearing down a partial reconfiguration (PR) persona of asector of the apparatus, reconfiguring a sector PR of a sector of theapparatus corresponding to the detected fault, performing a targetedfreeze of a memory block of a sector of the apparatus corresponding tothe detected fault, or power-gating at least one of the plurality ofsensors.
 15. The method of claim 10, wherein the apparatus comprises ahardware accelerator device comprising at least one a graphic processingunit (GPU), a central processing unit (CPU), or a programmableintegrated circuit (IC), and wherein the programmable IC comprises atleast one of a field programmable gate array (FPGA), a programmablearray logic (PAL), a programmable logic array (PLA), a fieldprogrammable logic array (FPLA), an electrically programmable logicdevice (EPLD), an electrically erasable programmable logic device(EEPLD), a logic cell array (LCA), or a complex programmable logicdevices (CPLD).
 16. A non-transitory machine readable storage mediumcomprising instructions that, when executed, cause at least oneprocessor to at least: receive, by a sensor aggregator of a hardwareaccelerator device comprising the at least one processor, sensor datafrom a sensor network comprising a plurality of sensors of the hardwareaccelerator device; analyze, by the sensor aggregator, the sensor datato detect a fault condition; determine, by the sensor aggregator, aspatial location of the fault condition based on the sensor data; andgenerate, by the sensor aggregator, an event for a secure device manager(SDM) of the hardware accelerator device to cause the SDM to mitigatethe fault condition.
 17. The non-transitory machine readable storagemedium of claim 16, wherein each sensor of the plurality of sensors inthe sensor network comprises a timing margin monitor.
 18. Thenon-transitory machine readable storage medium of claim 16, wherein thehardware accelerator device comprises a plurality of sectors, whereineach sector comprises a portion of the plurality of sensors of thesensor network, and wherein the spatial location is to reference atleast one of the plurality of sectors.
 19. The non-transitory machinereadable storage medium of claim 16, wherein the sensor aggregator toprocess the sensor data further comprises the sensor aggregator to atleast one of: apply a thresholding operation to the sensor data todetect the fault condition, or perform signal processing of atime-series of the sensor data to extract features that are used totrain a machine learning model to identify the fault condition.
 20. Thenon-transitory machine readable storage medium of claim 16, wherein theSDM to mitigate the fault condition by at least one of tearing down apartial reconfiguration (PR) persona of a sector of the apparatus,reconfiguring a sector PR of a sector of the apparatus corresponding tothe detected fault, performing a targeted freeze of a memory block of asector of the apparatus corresponding to the detected fault, orpower-gating at least one of the plurality of sensors.